:orphan: Core Basics 1: Train, Evaluate and Deploy a Classifier ====================================================== In this lesson we will learn how to train, evaluate and deploy classifiers with Khiops. Make sure you have installed `Khiops `__ and `Khiops Visualization `__. We start by importing Khiops and defining some helper functions: .. code:: ipython3 import os import platform import subprocess from khiops import core as kh # Define peek helper function def peek(file_path, n=10): """Shows the first n lines of a file""" with open(file_path, encoding="utf8", errors="replace") as file: for line in file.readlines()[:n]: print(line, end="") print("") # If there are any issues you may Khiops status with the following command # kh.get_runner().print_status() Training a Classifier --------------------- We’ll train a classifier for the ``Iris`` dataset. This is a classical dataset containing the data of different plants belonging to the genus *Iris*. It contains 150 records, 50 for each of three variants of *Iris*: *Setosa*, *Virginica* and *Versicolor*. The records for each sample contain the length and width of its petal and sepal. The standard task for this dataset is to construct a classifier for the type of *Iris* taking as inputs the length and width characteristics. Now to train a classifier with Khiops we use two types of files: - A plain-text delimited data file (for example a ``csv`` file) - A *dictionary* file which describes the schema of the above data table (``.kdic`` file extension) Let’s save into variables the locations of these files for the ``Iris`` dataset and then take a look at their contents: .. code:: ipython3 iris_kdic = os.path.join(kh.get_samples_dir(), "Iris", "Iris.kdic") iris_data_file = os.path.join(kh.get_samples_dir(), "Iris", "Iris.txt") print(f"Iris dictionary file: {iris_kdic}") peek(iris_kdic) print(f"Iris data file: {iris_data_file}\n") peek(iris_data_file) .. parsed-literal:: Iris dictionary file: /github/home/khiops_data/samples/Iris/Iris.kdic Dictionary Iris { Numerical SepalLength ; Numerical SepalWidth ; Numerical PetalLength ; Numerical PetalWidth ; Categorical Class ; }; Iris data file: /github/home/khiops_data/samples/Iris/Iris.txt SepalLength SepalWidth PetalLength PetalWidth Class 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3.0 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa 5.0 3.6 1.4 0.2 Iris-setosa 5.4 3.9 1.7 0.4 Iris-setosa 4.6 3.4 1.4 0.3 Iris-setosa 5.0 3.4 1.5 0.2 Iris-setosa 4.4 2.9 1.4 0.2 Iris-setosa Note that the *Iris* variant information is in the column ``Class``. Now let’s specify directory to save our results: .. code:: ipython3 iris_results_dir = os.path.join("exercises", "Iris") print(f"Iris results directory: {iris_results_dir}") .. parsed-literal:: Iris results directory: exercises/Iris We are now ready to train the classifier with the Khiops function ``train_predictor``. This method returns a tuple containing the location of two files: - the modeling report (``AllReports.khj``): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics. - model’s *dictionary* file (``Modeling.kdic``): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data. .. code:: ipython3 iris_report, iris_model_kdic = kh.train_predictor( iris_kdic, dictionary_name="Iris", data_table_path=iris_data_file, target_variable="Class", results_dir=iris_results_dir, max_trees=0, # by default Khiops constructs 10 decision tree variables ) print(f"Iris report file: {iris_report}") print(f"Iris modeling dictionary: {iris_model_kdic}") .. parsed-literal:: Iris report file: exercises/Iris/AllReports.khj Iris modeling dictionary: exercises/Iris/Modeling.kdic You can verify that the result files were created in ``iris_results_dir``. In the next sections, we’ll use the file at ``iris_report`` to assess the models’ performances and the file at ``iris_model_kdic`` to deploy it. Now we can see the report with the Khiops Visualization app: .. code:: ipython3 # To visualize uncomment the line below # kh.visualize_report(iris_report) Exercise ~~~~~~~~ We’ll repeat the examples on this notebook with the ``Adult`` dataset. It contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable ``class``, which indicates if the individual earns ``more`` or ``less`` than 50,000 dollars. Let’s start by putting into variables the paths for the ``Adult`` dataset: .. code:: ipython3 adult_kdic = os.path.join(kh.get_samples_dir(), "Adult", "Adult.kdic") adult_data_file = os.path.join(kh.get_samples_dir(), "Adult", "Adult.txt") Print the file locations and use the function ``peek`` to list their contents ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 print(f"Adult dictionary file: {adult_kdic}") peek(adult_kdic) print(f"Adult data file: {adult_data_file}\n") peek(adult_data_file) .. parsed-literal:: Adult dictionary file: /github/home/khiops_data/samples/Adult/Adult.kdic Dictionary Adult { Categorical Label ; Numerical age ; Categorical workclass ; Numerical fnlwgt ; Categorical education ; Numerical education_num ; Categorical marital_status ; Adult data file: /github/home/khiops_data/samples/Adult/Adult.txt Label age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country class 1 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States less 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States less 3 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States less 4 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States less 5 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba less 6 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States less 7 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica less 8 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States more 9 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States more We now save the results directory for this exercise: .. code:: ipython3 adult_results_dir = os.path.join("exercises", "Adult") print(f"Adult results directory: {adult_results_dir}") .. parsed-literal:: Adult results directory: exercises/Adult Train a classifier for the ``Adult`` database ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note the name of the target variable is ``class`` (**in lower case!**). Do not forget to set ``max_trees=0``. Save the resulting file locations into the variables ``adult_report`` and ``adult_model_kdic`` and print them .. code:: ipython3 adult_report, adult_model_kdic = kh.train_predictor( adult_kdic, dictionary_name="Adult", data_table_path=adult_data_file, target_variable="class", results_dir=adult_results_dir, max_trees=0, ) print(f"Adult report file: {adult_report}") print(f"Adult modeling dictionary file: {adult_model_kdic}") .. parsed-literal:: Adult report file: exercises/Adult/AllReports.khj Adult modeling dictionary file: exercises/Adult/Modeling.kdic Inspect the results with the Khiops Visualization app ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 # To visualize uncomment the line below # kh.visualize_report(adult_report) Accessing a Classifiers’ Basic Evaluation Metrics ------------------------------------------------- We access the classifier’s evaluation metrics by loading file at ``iris_report`` file with the Khiops function ``read_analysis_results_file``: .. code:: ipython3 iris_results = kh.read_analysis_results_file(iris_report) print(type(iris_results)) .. parsed-literal:: The resulting object is an instance of the ``AnalysisResults`` class. The model evaluation reports are stored in its ``train_evaluation_report`` and ``test_evaluation_report`` attributes which are of class ``EvaluationReport``. .. code:: ipython3 iris_train_eval = iris_results.train_evaluation_report iris_test_eval = iris_results.test_evaluation_report print(type(iris_train_eval)) print(type(iris_test_eval)) .. parsed-literal:: We access the default predictor’s metrics with the ``get_snb_performance`` method of the evaluation report objects: .. code:: ipython3 iris_train_performance = iris_train_eval.get_snb_performance() iris_test_performance = iris_test_eval.get_snb_performance() These objects are of class ``PredictorPerformance`` and have ``accuracy`` and ``auc`` attributes for these metrics: .. code:: ipython3 print(f"Iris train accuracy: {iris_train_performance.accuracy}") print(f"Iris test accuracy: {iris_test_performance.accuracy}") print("") print(f"Iris train AUC: {iris_train_performance.auc}") print(f"Iris test AUC: {iris_test_performance.auc}") .. parsed-literal:: Iris train accuracy: 0.980952 Iris test accuracy: 0.955556 Iris train AUC: 0.997868 Iris test AUC: 0.984362 Exercise ~~~~~~~~ Read the contents of the file at ``adult_report`` for the Adult analysis and print its type ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_results = kh.read_analysis_results_file(adult_report) type(adult_results) .. parsed-literal:: khiops.core.analysis_results.AnalysisResults Save the evaluation reports of the ``Adult`` classification to the variables ``adult_train_eval`` and ``adult_test_eval`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_train_eval = adult_results.train_evaluation_report adult_test_eval = adult_results.test_evaluation_report Show the model’s train and test accuracies and AUCs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 adult_train_performance = adult_train_eval.get_snb_performance() adult_test_performance = adult_test_eval.get_snb_performance() print(f"Adult train accuracy: {adult_train_performance.accuracy}") print(f"Adult test accuracy: {adult_test_performance.accuracy}") print("") print(f"Adult train AUC: {adult_train_performance.auc}") print(f"Adult test AUC: {adult_test_performance.auc}") .. parsed-literal:: Adult train accuracy: 0.869295 Adult test accuracy: 0.865714 Adult train AUC: 0.926145 Adult test AUC: 0.921665 Deploying a Classifier ---------------------- We are going to deploy the ``Iris`` classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file ``iris_model_kdic``. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Just this time let’s take a quick look at its contents: .. code:: ipython3 peek(iris_model_kdic, 25) .. parsed-literal:: #Khiops 10.3.0 Dictionary SNB_Iris { Unused Numerical SepalLength ; Unused Numerical SepalWidth ; Unused Numerical PetalLength ; Unused Numerical PetalWidth ; Unused Categorical Class ; Unused Structure(DataGrid) VClass = DataGrid(ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 32, 35)) ; Unused Structure(DataGrid) PPetalLength = DataGrid(IntervalBounds(3.15, 4.75, 5.15), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 1, 26, 5, 0, 0, 0, 9, 26)) ; // DataGrid(PetalLength, Class) Unused Structure(DataGrid) PPetalWidth = DataGrid(IntervalBounds(0.75, 1.75), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 31, 1, 0, 2, 33)) ; // DataGrid(PetalWidth, Class) Unused Structure(Classifier) SNBClass = SNBClassifier(Vector(0.3515625, 0.4375), DataGridStats(PPetalLength, PetalLength), DataGridStats(PPetalWidth, PetalWidth), VClass) ; Categorical PredictedClass = TargetValue(SNBClass) ; Unused Numerical ScoreClass = TargetProb(SNBClass) ; Numerical `ProbClassIris-setosa` = TargetProbAt(SNBClass, "Iris-setosa") ; Numerical `ProbClassIris-versicolor` = TargetProbAt(SNBClass, "Iris-versicolor") ; Numerical `ProbClassIris-virginica` = TargetProbAt(SNBClass, "Iris-virginica") ; }; Note that the modeling dictionary contains 5 used variables: - ``Class`` : The original target of the dataset - ``PredictedClass`` : The class with the highest probability according to the model - ``ProbClassIris-setosa``, ``ProbClassIris-versicolor``, ``ProbClassIris-virginica``: The probabilities of each class according to the model These will be the columns of the output table when deploying the model: .. code:: ipython3 iris_deployment_file = os.path.join(iris_results_dir, "iris_deployment.txt") kh.deploy_model( iris_model_kdic, dictionary_name="SNB_Iris", data_table_path=iris_data_file, output_data_table_path=iris_deployment_file, ) peek(iris_deployment_file) .. parsed-literal:: PredictedClass ProbClassIris-setosa ProbClassIris-versicolor ProbClassIris-virginica Iris-setosa 0.9884494887 0.008598869265 0.002951642068 Iris-setosa 0.9884494887 0.008598869265 0.002951642068 Iris-setosa 0.9884494887 0.008598869265 0.002951642068 Iris-setosa 0.9884494887 0.008598869265 0.002951642068 Iris-setosa 0.9884494887 0.008598869265 0.002951642068 Iris-setosa 0.9884494887 0.008598869265 0.002951642068 Iris-setosa 0.9884494887 0.008598869265 0.002951642068 Iris-setosa 0.9884494887 0.008598869265 0.002951642068 Iris-setosa 0.9884494887 0.008598869265 0.002951642068 Exercise ~~~~~~~~ Use the ``deploy_model`` function to deploy the model stored in the file at ``adult_model_kdic`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Which columns are deployed? .. code:: ipython3 adult_deployment_file = os.path.join(adult_results_dir, "adult_deployment.txt") kh.deploy_model( adult_model_kdic, dictionary_name="SNB_Adult", data_table_path=adult_data_file, output_data_table_path=adult_deployment_file, ) peek(adult_deployment_file) .. parsed-literal:: Predictedclass Probclassless Probclassmore less 0.9999926658 7.33418716e-06 more 0.4122763795 0.5877236205 less 0.9624691952 0.03753080482 less 0.9158716208 0.08412837917 less 0.5717571015 0.4282428985 more 0.2594836411 0.7405163589 less 0.9939376151 0.006062384897 more 0.4223655109 0.5776344891 more 0.001798128 0.998201872